Update Gemma2 attention scale #694

mntss · 2024-08-07T09:25:49Z

Description

Current configuration uses incorrect attention scale. According to deepmind implementation the 2b and 9b versions use sqrt(d_head). This is the default scale in TL.
https://github.com/google-deepmind/gemma/blob/a0504162f99a1c238efb37b8197e711c0f3808fd/gemma/transformer.py#L152-L174

Fixes #693

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

neelnanda-io · 2024-08-07T17:30:53Z

Huh, you seem to be correct my bad. I'm not sure why my sanity checks didn't show an enormous divergence
Can you check if the attention patterns are notably closer to HuggingFace after this change?

mntss · 2024-08-07T19:38:47Z

HF implementation uses SPDA so I'd need to switch the attention implementation to access the pattern. I compared the output instead (hook_z) across layers

ccp123456 · 2024-08-11T06:16:27Z

This question is very helpful for me and I am also being confused by this question! How do I adjust the Transformerlens' Gemma to ensure that the Transformerlens Gemma gives the same results as the HF? Which file should I add the above code to?

bryce13950 · 2024-08-11T23:03:05Z

@mntss Thank you very much for this! @ccp123456 this will be put into a release relatively quickly, so you should be able to make use of this right away. The question of 100% accuracy is a little bit more complicated, and making sure Gemma models are 100% accurate is a pretty high priority task at the moment. If you interested in knowing more, DM me on slack. (If you are not on the slack channel, let me know and I can give you access.)

Update Gemma2 attention scale

a7a3b23

remove import

ccf4cac

bryce13950 changed the base branch from main to dev August 11, 2024 22:59

bryce13950 changed the base branch from dev to main August 11, 2024 23:00

bryce13950 changed the base branch from main to dev August 11, 2024 23:00

bryce13950 merged commit e30f96b into TransformerLensOrg:dev Aug 11, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Gemma2 attention scale #694

Update Gemma2 attention scale #694

mntss commented Aug 7, 2024

neelnanda-io commented Aug 7, 2024

mntss commented Aug 7, 2024

ccp123456 commented Aug 11, 2024

bryce13950 commented Aug 11, 2024

Update Gemma2 attention scale #694

Update Gemma2 attention scale #694

Conversation

mntss commented Aug 7, 2024

Description

Type of change

Screenshots

Checklist:

neelnanda-io commented Aug 7, 2024

mntss commented Aug 7, 2024

ccp123456 commented Aug 11, 2024

bryce13950 commented Aug 11, 2024